We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
translated by 谷歌翻译
The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter. The model will be released on Huggingface(Please refer to: https://huggingface.co/m-a-p/music2vec-v1)
translated by 谷歌翻译
The goal of 3D pose transfer is to transfer the pose from the source mesh to the target mesh while preserving the identity information (e.g., face, body shape) of the target mesh. Deep learning-based methods improved the efficiency and performance of 3D pose transfer. However, most of them are trained under the supervision of the ground truth, whose availability is limited in real-world scenarios. In this work, we present X-DualNet, a simple yet effective approach that enables unsupervised 3D pose transfer. In X-DualNet, we introduce a generator $G$ which contains correspondence learning and pose transfer modules to achieve 3D pose transfer. We learn the shape correspondence by solving an optimal transport problem without any key point annotations and generate high-quality meshes with our elastic instance normalization (ElaIN) in the pose transfer module. With $G$ as the basic component, we propose a cross consistency learning scheme and a dual reconstruction objective to learn the pose transfer without supervision. Besides that, we also adopt an as-rigid-as-possible deformer in the training process to fine-tune the body shape of the generated results. Extensive experiments on human and animal data demonstrate that our framework can successfully achieve comparable performance as the state-of-the-art supervised approaches.
translated by 谷歌翻译
由于其广泛的应用,尤其是在现场理解领域,因此在3D点云上进行的实例细分一直在吸引越来越多的关注。但是,大多数现有方法都需要完全注释培训数据。在点级的手动准备地面真相标签非常繁琐且劳动密集型。为了解决这个问题,我们提出了一种新颖的弱监督方法RWSEG,该方法仅需要用一个点标记一个对象。有了这些稀疏的标签,我们使用自我注意事项和随机步行引入了一个带有两个分支的统一框架,分别将语义和实例信息分别传播到未知区域。此外,我们提出了一个跨画竞争的随机步行(CGCRW)算法,该算法鼓励不同实例图之间的竞争以解决紧密放置对象中的歧义并改善实例分配的性能。 RWSEG可以生成定性实例级伪标签。 Scannet-V2和S3DIS数据集的实验结果表明,我们的方法通过完全监督的方法实现了可比的性能,并且通过大幅度优于先前的弱监督方法。这是弥合该地区弱和全面监督之间差距的第一项工作。
translated by 谷歌翻译
3D姿势传输是最具挑战性的3D生成任务之一。它旨在将源网的姿势传递到目标网格,并保持目标网格的身份(例如,体形)。某些以前的作品需要关键点注释来构建源网格和目标网格之间的可靠对应,而其他方法不考虑源和目标之间的任何形状对应,这导致了有限的发电质量。在这项工作中,我们提出了一种通信细化网络,以帮助为人类和动物网格进行3D姿势转移。首先通过解决最佳运输问题来建立源网和目标网格之间的对应关系。然后,我们根据密集的对应探讨源网格并获得粗糙的翘曲网格。通过我们提出的弹性实例标准化,翘曲的网格将更好地精制,这是一个条件归一化层,可以帮助产生高质量网格。广泛的实验结果表明,所提出的架构可以有效地将源从源转移到目标网格,并提供比最先进的方法满意的视觉性能更好的结果。
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
Previous databases have been designed to further the development of fake audio detection. However, fake utterances are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audios. They ignore a fake situation, in which the attacker manipulates an acoustic scene of the original audio with another forgery one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper designs such a dataset for scene fake audio detection (SceneFake). A manipulated audio in the SceneFake dataset involves only tampering the acoustic scene of an utterance by using speech enhancement technologies. We can not only detect fake utterances on a seen test set but also evaluate the generalization of fake detection models to unseen manipulation attacks. Some benchmark results are described on the SceneFake dataset. Besides, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented on the dataset. The results show that scene manipulated utterances can not be detected reliably by the existing baseline models of ASVspoof 2019. Furthermore, the detection of unseen scene manipulation audio is still challenging.
translated by 谷歌翻译
进行了许多有效的尝试进行了DeepFake音频检测。但是,他们只能区分真实和假货。对于许多实际的应用程序方案,还需要哪种工具或算法生成DeepFake音频。这提出了一个问题:我们可以检测到DeepFake音频的系统指纹吗?因此,本文进行了初步研究,以检测DeepFake音频的系统指纹。实验是从五个最新的深入学习语音合成系统的DeepFake音频数据集上进行的。结果表明,LFCC功能相对适合系统指纹检测。此外,RESNET在基于LCNN和X-Vector模型中获得了最佳检测结果。T-SNE可视化表明,不同的语音合成系统会生成不同的系统指纹。
translated by 谷歌翻译
已经进行了许多有效的尝试来进行虚假的音频检测。但是,他们只能提供检测结果,但没有对抗这种伤害的对策。对于许多相关的实际应用,也需要哪种模型或算法生成假音频。因此,我们提出了一个新问题,用于检测虚假音频的Vocoder指纹。实验是在由八个最先进的歌手合成的数据集上进行的。我们已经初步探索了功能和模型体系结构。T-SNE可视化表明,不同的Vocoder会生成不同的Vocoder指纹。
translated by 谷歌翻译
现有的假音频检测系统通常依靠专家经验来设计声学功能或手动设计网络结构的超参数。但是,人工调整参数可能会对结果产生相对明显的影响。几乎不可能手动设置最佳参数集。因此,本文提出了一种完全自动化的终端伪造音频检测方法。我们首先使用WAV2VEC预训练模型来获得语音的高级表示。此外,对于网络结构,我们使用了名为Light-Darts的可区分体系结构搜索(飞镖)的修改版本。它学习了深厚的语音表示,同时自动学习和优化包括卷积操作和残留块组成的复杂神经结构。 ASVSPOOF 2019 LA数据集的实验结果表明,我们提出的系统达到的错误率(EER)为1.08%,这表现优于最先进的单个系统。
translated by 谷歌翻译